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Abstract 

Consider the following three important problems in statistical inference, namely, constructing confidence inter¬ 
vals for (1) the error of a high-dimensional {p > n) regression estimator, (2) the linear regression noise level, and 
(3) the genetic signal-to-noise ratio of a continuous-valued trait (related to the heritability). All three problems 
turn out to be closely related to the little-studied problem of performing inference on the ^ 2 -norm of the signal in 
high-dimensional linear regression. We derive a novel procedure for this, which is asymptotically correct when the 
covariates are multivariate Gaussian and produces valid confidence intervals in hnite samples as well. The proce¬ 
dure, called EigenPrism, is computationally fast and makes no assumptions on coefficient sparsity or knowledge 
of the noise level. We investigate the width of the EigenPrism confidence intervals, including a comparison with a 
Bayesian setting in which our interval is just 5% wider than the Bayes credible interval. We are then able to unify 
the three aforementioned problems by showing that the EigenPrism procedure with only minor modifications is 
able to make important contributions to all three. We also investigate the robustness of coverage and find that 
the method applies in practice and in hnite samples much more widely than just the case of multivariate Gaussian 
covariates. Finally, we apply EigenPrism to a genetic dataset to estimate the genetic signal-to-noise ratio for a 
number of continuous phenotypes. 
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1 Introduction 

1.1 Problem Statement 

Throughout this paper we will assume the linear model 

y = Xf3 + e, (1.1) 

where y, e G M", /3 € and X G Denote the row and column of X by Xi and Xj, respectively. We 

assume the Xi are drawn i.i.d. from a mean-zero distribution with covariance matrix S. 

Our goal is to construct a two-sided confidence interval (Cl) for the expected signal squared magnitude 0^ := 
||S^/^/3||2 (or equivalently just 0). Explicitly, for a given significance level a G (0,1), we want to produce statistics 
La and Ua, computed from the data, obeying 


F{0^<La)<a/2, 

P (0^ > Ua) < a/2. 

In words, we want to be able to make the following statement: “with 100(1 — a)% confidence, 0^ lies between La 
and C/q,.” 

1.2 Motivation 

This problem can be motivated first from a high level as an approach to performing inference on f3 in high dimensions. 
Since p > n, we cannot hope to perform inference on the individual elements of (3 directly (without further assump¬ 
tions, such as sparsity), but there is hope for the one-dimensional parameter 0. Although 0 is not often considered a 
parameter of inference in regression problems, it turns out to be closely related to a number of well-studied problems. 

Suppose one has an estimator (3 for /3. Perhaps the most important question to be asked is: how close is (3 to /3? 
This question can be answered statistically by estimating and/or constructing a Cl for the error of that estimate, 
namely, ||/3 — /3||2- This is a fundamental statistical problem arising in many applications. Consider, for example, a 
compressed sensing (CS) experiment in which a doctor performs an MRI on a patient. In MRI, the image is observed 
not in the spatial domain, but in the frequency domain. If as many observations as pixels are made, the result is the 
Fourier transform (with some added noise) of the image, from which the original spatial pixels can be inferred. CS 
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theory suggests that one can instead use a number of observations (rows of the Fourier matrix) that is a fraction of 
the number of pixels, and still get very good recovery of the original image using perhaps sophisticated £i methods 
(Candfe et al. 2006). However, for a specific instance, there is no good way to estimate how “good” the recovery 


is. This can be important if the doctor is looking for a specific feature on the MRI, such as a small tumor, and 
needs to know if what he or she sees on the reconstructed image is accurate. In the authors’ experience, this is the 
most common question asked by end-users of CS algorithms. Put another way, when the Nyquist sampling theorem 
is violated, there is always a possibility of missing some of the signal, so what reassurances can we make about the 
quality of the reconstruction? 

The estimation of the noise level cr^ in a linear model is another important statistical problem. Consider, for 
example, performing inference on individual coefficients in the linear model. When n > p, OLS theory provides an 
answer that depends on or at least an estimate of it. Indeed, one can find in almost any introductory statistics 
textbook both estimation and inference results for cr^ in the case oin > p. However much recent work has investigated 


the problem of performing inference on individual coefficients in the high-dimensional setting oi n < p (Berk et al. 
2013[ [Lockhart et al.[ |2014 Taylor et al.[ 12014} [Javanmard and Montanari[ [^14[ [van de Geer et al.[ [2014[ Zhang] 


and Zhang 


2014 


Lee et al. 


2015), and they all require knowledge of a^. Unfortunately very few such results exist 
for the high-dimensional setting oi n < p. Beyond regression coefficient inference, can be useful for benchmarking 
prediction accuracy and for performing model selection, for instance using AIC, BIG, or the Lasso. It also may be 
of independent interest to know for instance to understand the variance decomposition of y. 

A third topic is the study of genetic heritability (Visscher et al. 2008), which can be characterized by the 


following question: what fraction of variance in a trait (such as height) is explained by our genes, as opposed to our 
environment? Colloquially, this can be considered a way of quantifying the nature versus nurture debate. 

It turns out that all three of these problems can be solved by connection with our original problem of estimating 
and constructing CIs for 0^. Indeed, in the MRI example, the doctor may split the collected observations into two 
independent subsamples, and and construct an estimator /3 from just (y^*^\ Then 

the vector y := y^^^ — X^^'>0 follows a linear model. 


= xW(/3-/§) + £, 


(1.3) 


so that if S = /, inference on 0 in this linear model corresponds exactly to inference on the £2 regression error of (3. 
Note that since the analysis is conditional on /3, there is no restriction on how (3 is computed from (y’'°\ and 

so the method applies to any coefficient estimation technique. We defer the connection between inference for and 
inference for cr^ and genetic variance decomposition to Section 


1.3 Main Result 

Although we will ultimately argue that our method applies more broadly, we will begin with the following distribu¬ 
tional assumptions, 

X, iV(0, Ip), e, • 7V(0, a^), (1.4) 

with X independent of e. Note that p > n ensures the design matrix will have a nontrivial null space, and thus 
conditional on X, the linear model <[I3 (including 9) is unidentifiable (since any vector in the null space of X can 
be added to £3 without changing the data-generating process). This necessitates a random design framework. The 
assumption of independence on the rows of the design matrix is often satisfied in realistic settings when observations 
are drawn independently from a population. However, the independence (and multivariate Gaussianity) of the 
columns is rather stringent and just a starting point—Sections |3.3[ |4.1[ and demonstrate in simulations and on 
real data that in practice EigenPrism achieves nominal coverage even when the marginal distribution of the entries 
of X are far from Gaussian, as well as in some cases when S 7 ^ /. We are treating the coefficient vector (3 as fixed, 
not random. 

Under these assumptions, we will develop in Section]^ an estimator that is unbiased for 9^, is asymptotically nor¬ 
mally distributed, and has an estimable tight bound on its variance. None of these properties, including estimability 
of the variance, require knowledge of the noise level or any assumption, such as sparsity, on the structure of the 
coefficient vector (3. From these results, it is easy to generate valid GIs for 9^ (or 9), and we will show that such GIs 
are nearly as short as they can be, and provide nominal coverage in finite samples under a variety of circumstances 
(even beyond the assumptions made here). 


1.4 Related Work 

When n > p, ordinary least squares (OLS) theor y gives us inf er ence for (3 an d thus also for 9. When n < p, 
the problem of estimating 9^ has been studied in Dicker (2014). Dicker (2014) uses the method of moments on 
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two statistics to estimate 6^ and cr^ without assumptions on /3, and with the same multivariate Gaussian random 
design assumptions used here. Dicker (2014) also derives asymptotic distributional results, but does not explore the 
estimation of the parameters of the asymptotic distributions, nor the coverage of any Cl derived from it. The main 
contribution of our work is to provide tight, estimable CIs which achieve nominal coverage even in finite samples. 

Inference for high-dimensional regression error, noise level, and genetic variance decomposition are each individu¬ 
ally well-studied, so we review some relevant works here. To begin with, many authors have studied high-dimensional 
regression error for specific coefficient estimators, such as the Las so ([Tibshirani 1996), often providing conditions 
under which this regression error asymptotes to 0 (see for example Bayati et al. (2013); Knight and Fu (2000|)). To 


our knowledge the only author who has considered inference for a general estimator is |Ward| ( |2009[), w ho does so 
using the Johnson-Lindenstrauss Lemma and assuming no noise, that is, e 

problem studied there is quite different from that addressed here, as we allow for noise in the linear model. 


1.1| ). Thus the 
Fur¬ 


thermore, because the Johnson-Lindenstrauss Lemma is not distribution-specific, it is conservative and thus Ward’s 
bounds are in general conservative, while we will show that in most cases our CIs will be quite tight. 

There has also been a lot of recent interest in estimating the noise level cr^ in high-dimensional regression problems. 
Fan et al. (2012) introduced a refitted cross validation method that estimates assuming sparsity and a model 


selection procedure that misses none of the correct variables. Sun and Zhang ( 2012 ) intro duced the scaled Lasso for 
estimating cr^ using an iterative procedure that includes the Lasso. Stadler et al. (2010) also use an lx penalty to 


estimate the noise level, but in a finite mixture of regressions model. Bayati et al. (2013) use the Lasso and Stein’s 
unbiased risk estimate to produce an estimator for . All of these works prove consistency of their estimators, 
but under conditions on the sparsity of the coefficient vector. Indeed, it can be shown (Giraud et al. 2012) that 
such a condition is needed when X is treated as fixed (which it is not in the present paper). Under the same 
sparsity conditions, Fan et al. (2012) and Sun and Zhang (2012) also provide asymptotic distributional results for 
their estimators, allowing for the construction of asymptotic CIs. What distinguishes our treatment of this problem 
from the existing literature is that our estimator and Cl for make no assumptions on the sparsity or structure of 

/3- 

An unpublished paper (Owen, 2012) estimates 6^ using a type of method of moments, with the goal of estimating 
genetic heritability by way of a variance decomposition. Although |Owen| gives conditions for consistency of his 
esimator, no inference is discussed, and he points out that the work is only valid for estimating heritability if the 
SNPs are assumed to be independent. In general, heritability is a well-studied subject in genetics, with especially 
accurate estimates coming from studies comparing a trait within and between twins (e.g. Silventoinen et al. (2003)). 
However, in order to better understand the genetic basis of such traits, some authors have tried to directly predict 
a trait from genetic information. Since most forms of genetic information, such as SNP data, are much higher¬ 
dimensional than the number of samples that can be obtained, the main approaches are either to try and find a 
small number of important variables through genome-wide association studies (e.g. Weedon et al. (2008)) before 
modeling, to estimate the kinships among subjects and use maximum likelihood, assuming independence among 
SNPs and random effects, on the trait covariances among subjects to estimate the (narrow-sense) heritability (e.g. 


Yang et al. (2010); Golan and Rosset (2011)), or to assume random effects and use maximum likelihood to estimate 


the signal-to-noise ratio in a linear model (e.g. Kang et al. (2008); Bonnet et al. (2014); Owen (2014)). However, 
attempts to explain heritability by genetic prediction have fallen quite short of the estimates from twin studies, 
leading to the famous conundrum of missing heritability (Manolio et al., 2009). Our main contribution to this 


field will be to consistently estimate and provide inference for the signal-to-noise ratio in a linear model, which 
is related to the heritability, without assumptions on the coefficient vector (such as sparsity or random effects), 
knowledge of the noise variance, or feature independence. This contribution may be especially valuable given the 
increased popularity of the rare variants hypothesis (Pritchard 2001) for missing heritability, which conjectures that 
the effects of genetic variation on a trait may not be strong and sparse, but instead distributed and weak (and their 
corresponding mutations rare). 

We note that neuroscientists have also done work estimating a signal-to-noise ratio, namely the explainable 
variance in functional MRI. That problem is made especially challenging due to correlations in the noise, making it 
different from the i.i.d. noise setting considered in this paper. For this related problem, Benjamini and Yu (2013) 
are able to construct an unbiased estimator in the random effects framework by permuting the measurement vector 
in such a way as to leave the noise covariance structure unchanged. 


2 Constructing a Confidence Interval for 6 ^ 

In this section we develop a novel method for constructing a valid Cl for 9^. This method does not require to 
be known. However, for pedagogical reasons, we begin with the simpler situation in which is known, which may 
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arise in many signal or image processing applications. 


2.1 Known cr^ 


Consider a sample of size n from the linear model (1.1). Then 


N{0,9^+a^), 


which implies 


Ml 


02+^2 ^ n - ( 2 . 1 ) 

Denote the quantile of the Xn distribution by Then when is known, a valid Cl can be obtained by 

setting 


— 


\yU 


u„ = 


\y\\l 


— (7 


^l-a/2 '^a/2 

that is, (1.2) is satisfied under this choice of Ua- Note that the method of |Ward|(|2009|) also assumes cr^ is known, 
and equal to zero, so we may consider comparing it to the above. In particular we want to emphasize that |Ward| 
(2009)’s inference method is conservative due to the generality of the Johnson-Lindenstrauss lemma, while [La, Ua] 
contitutes an exact 100(1 — a)% CL The same procedure can be generalized using the bootstrap on the unbiased 
estimator 

T,:=\\y\\l/n-a^ . (2.2) 

See Appendix 1^ for details. 


2.2 Unknown cx^ 

2.2.1 Theory 


Consider again the linear model (1.1) with assumptions (1.4), in particular that X has i.i.d. standard Gaussian 
elements. Recall that we assume n < p, and let X = UDV' be a singular value decomposition (SVD) of X, so 
that U is n X n orthonormal, D is n x n diagonal with non-negative, non-increasing diagonal entries, and V is p x n 
orthonormal. Let jz = U^y, and denote the diagonal vector of D by d. We emphasize that the singular values in D 
are arranged along the diagonal in decreasing order, so that di > 9,2 dn > 0- Then 


z = D{V^I3) 


U^e, 


and note that 


E{zf\d) =E 
= d^E 


-f 

{vjffj 


U'e 


d 


d 


+ 2d,E {VjI3U:^ ejd) -hE 


(c7.)= 


d 


= di9^lp + a^ 


where the third equality follows from the fact that in our model the columns of V are uniformly distributed on the 
unit sphere, and independent of d. 

To give some intuition for what follows, assume n is even and consider the expectation, conditional on d, of the 
difference between the sum of squares of the first half of the entries of z and the sum of squares of the second half 
of the entries of z, 

( nl2 n 

■i—1 i—n/2+1 

n2 , , 

^ 2=1 


nl2 

I = I H ■ 


:Cr - 


2=1 


E 

\A—n/2-\-l 


d^Oyp+^a^ 


Note that the terms containing cr^ in the first line cancel out, but because the singular values di of X are in decreasing 
order, a term proportional to 9^ remains. We generalize this idea below. 
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Let Xi = df/p be the eigenvalues of XX^ jp, let w G K" be a vector of weights (which need not be nonnegative), 
and consider the statistic S = We can compute its expectation, conditional on d, as 


E{S\ d)=E 





n n 

= ^ WiXi + ^ Wi. 

2=1 2=1 


(2.3) 


Based on this calculation, constraining Wi = 0 and = 1 makes S an unbiased estimator of 9“^ (even 

conditionally on d). We can also compute its conditional variance (see Appendix for a detailed computation), 


Var(5'|d) = 2(7"^ ^ wf + ^ w^Xi + 29‘^ 

i=l i=l 


E 

2=1 


2 \2 


wfX 


{YJl=lW^Xi) 


(2.4) 


which, under the aforementioned constraint '^i^i = 1 can be rewritten as 


= 2ct‘^ ^ wf + 4:a^9^ ^2 w^Xi + 29^ j 

i=l i=l 

n 

<2Y,wUX,9^ ^ 

2=1 

n 

= 2 (6»2 + cr^)^ wl (A,p + 1 - p)^ 


p + 2 







(2.5) 


where 



( 2 . 6 ) 


is the fraction of the variance of the yi accounted for by the signal (recall that Var(yi) = 9^ -\- u^). The inequality 
will be quite tight when p is large and ^ 1- By noting that this variance bound, as a function of p, is a 

quadratic equation with positive leading coefficient, it follows that it is maximized either at p = 0 or at p = 1. This 
leads to one more upper-bound. 


Var(S'|d) < 2 -|- cr^)^ • max ( ^ 


2 ^2 


^i=l 


2 \ ' z \ 

Wi , 2^ Wi A: 

i=l 


(2.7) 


The above equation has two striking features. The first is that it depends on 9^ and cr^ only through the sum 
9^ + cd, for which we have an excellent estimator given by The second feature is that it separates into 

the product of two terms: one term that does not depend on tu, and a second term that is known (in that it 
contains nothing that needs to be estimated) and (strictly) convex in w. Thus we can use convex optimization to 


find the vector w that minimizes the upper-bound (2.7) on the variance subject to the two linear equality constraints 
mentioned earlier, Y^^=iWi = 0 and 'Y^^=iWiXi = 1, which ensure that S remains unbiased for 9'^. Figureshows 
an example of such an optimized weight vector when n = 200 and p = 2000. Note that instead of just giving some 
positive weight to large A^’s and some negative weight to small A^’s, the optimal weighting is a smooth function of 
the Xi- This makes sense, as the Zi’s with large associated Xi have a larger signal-to-noise ratio, and should be given 
greater weight. Denote the statistic S constructed using these constrained-optimal weights by T 2 . Explicitly, let w* 
be the solution to the following convex optimization program Vp. 


/ n n \ n n 

argmin max > wfX^ such that > Wi = 0, > WiXi = 1 

■ujGK’' ' 


( 2 . 8 ) 


\i—l 2 = 1 / 2 = 1 2=1 

and denote by val('Pi) the minimized objective function value. Then the statistic for our main procedure in this 
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Figure 1: Plot of weights Wi as a function of normalized eigenvalues for n = 200 and p = 2000. 


paper, which we call the EigenPrism procedure, is the following, 

n 

i 

E(r2| d) =9 


* 2 
W, z, , 


i=l 

q2 


(2.9) 


SD(T2|d) < 


|y|li 


n 

where the only approximation in the variance is the replacement of 0^ + by its estimator 

With these calculations in place, we now define our (1 — a)-confidence interval for 0^, by assuming that T 2 follows 
an approximately normal distribution (discussed later on). We construct lower and upper endpoints 


L„ := max \ To — z 


^l-a/2 


■ v/2val(iPi) 

n 


:= To 


'■\-al 2 


V2val(iPi) 


!y|ll 


where the value of is clipped at zero since it holds trivially that 0^ > 0, and where z.\_^i^ is the (1 — ajT) quantile 
of the standard normal distribution. 

Remark. The idea of constructing the z^’s as contrasts has been used in the heritability literature before, 
e.g. Kang et al. (2008); Bonnet et al. ( 2014[ ); Owen (2014), but in a strict random effects framework. In particular, 
when the entries of /3 are i.i.d. Gaussian, the z^’s become independent. With independent z^’s whose distribution 
depends only on the signal (0^) and noise (tr^) parameters, the authors are able to apply maximum likelihood 
estimation, with associated asymptotic inference results for the signal, noise, or signal-to-noise ratio (we note that 
Bonnet et al. (2014) generalize such estimators somewhat to the case of a Bernoulli-Gaussian random effects model). 
The crucial difference between our work and theirs is that we make no assumptions (e.g., Gaussianity, sparsity) on 
the coefficient vector, and thus not only are the z^’s not independent in our setting, but their dependence (and thus 
the full likelihood) is a function of the products and thus a maximum likelihood approach in this setting would 
still be overparameterized. 

Next, we discuss the coverage and width properties of this constructed confidence interval. 


2.2.2 Coverage 

Now that we are equipped with an unbiased estimator and a computable variance (upper-bound), and have con¬ 
structed a confidence interval (Cl) using a normal approximation, there are two main questions to answer in order 
to determine whether these CIs will exhibit the desired coverage properties. In particular, we would like to know if 
substituting 0^ -|- with ||y|| 2 /?^ substantially affects the variance formula, and we would like to know if T 2 is ap¬ 
proximately normally distributed (so that we can construct arbitrary CIs from just the second moment). For the first 
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question, since HyH^ is a rescaled Xn random variable, for nominal coverage of 1 — a, the coverage actually achieved 
can be closely approximated by P {|-/V(0,1)| < Zi-a /2 ■ Xn/^} (where the N{0, 1) and the Xn independent), as¬ 
suming exact normality. Table [T] shows that for nominal 95% coverage, one would need fewer than 20 samples to 
achieve less than 90% coverage. For the second question, the following theorem establishes the asymptotic normality 


n 

10 

20 

50 

100 

500 

1000 

5000 

Coverage 

87.5% 

91.0% 

93.3% 

94.1% 

94.8% 

94.9% 

95.0% 


Table 1: Values of P {|A^(0,1)| < zq.qts • Xn/^} for ^ range of n. 


0 fT 2 . 


Theorem 1. Under the linear model (1.1) with Gaussian random design and errors given in Equation (1.4), the 
estimator T 2 as defined in Equation (2.9) is asymptotically normal as n,p —>■ 00 and n/p —>■ 7 € (0,1). This holds 
for any values of 9^ and cr^, including values that vary with n. Explieitly, 


Proof. The proof is given in Appendix [C| 


SD(r2|d) 


iV(0,l). 


For finite-sample results, we defer to the simulation results of Section [TT] to show that for problems of reasonable 


size (min{n,p} > 100), CIs constructed as if T 2 were exactly normal with variance exactly given by Equation (2.9) 
never result in below-nominal coverage. 


2.2.3 Width 

Once we have confirmed that our CIs provide the proper coverage, the next topic of interest is their widths. It is not 
hard to obtain a closed-form asymptotic upper-bound for Var(r 2 ) (the details are worked out in Appen dix [d| . In 
particular, letting V,, denote a random variable with Marcenko-Pastur (MP) distribution with parameter 7 (Marcenko 
and Pastui'l 1967), and AFy denote the median of Vy, define the constants 


Then in the limit as n,p 


Ay, =E [Vy • {lY.y>M.y ” 1 V., < M,^ ) ] , 

^7=E(Vy2). 

oo and njp —>■ 7 S ( 0 , 1 ), 


VVar(T2) 


6»2 -H, 


r 2 - 


< ^2 ■ 


max 


Ay ]■ 


( 2 . 10 ) 


( 2 . 11 ) 


We can draw a few conclusions from Equation (2.11). The most obvious is that for n,p —> 00 , n/p —> 7 S (0,1), 


asymptotically bounded above and 9^ asymptotically bounded below, the error of T 2 , as a fraction of its estimand 
02, converges to 0 in probability at a rate of Note that we make no assumptions at all on the structure of (3, 

and just require that 9^ does not asymptote at 0. The equation also lets us compute a conservative upper-bound 
on the asymptotic relative efficiency (ARE), defined as the asymptotic ratio of standard deviations (although it is 
often defined by variances elsewhere), of T 2 with respect to Ti from Section 2.1 (see (2.2)), the latter of which uses 
exact knowledge of cr^ and has standard deviation characterized by the Xn distribution. While we may not be able 
to formulate a closed-form expression for it in terms of expectations due to the constrained minimization functional. 


the standard deviation bound for T 2 in Equation (2.9) will also converge to a constant times SD(Ti) under the same 


asymptotic conditions, where the constant depends only on the MP distribution. This is because the optimal weights 
are a smooth function of the A^. Due to fast convergence to the MP distribution, we can numerically approximate 
this exact asymptotic ratio. Figure shows this estimate of the ARE of T 2 to Ti as a function of 7 . Note that the 
standard deviation bound for T 2 in Equation (2.9), used to compute the curve in Figure]^ is still an upper-bound for 
the ARE of T 2 with respect to Ty, but it reflects the ratio of Cl widths between the EigenPrism procedure and a Cl 
constructed from Ti with knowledge of a^. The figure demonstrates how close in width the EigenPrism procedure 
comes to an exact Cl for Ti which knows cr^. In particular, for 7 > 0.25, the EigenPrism CIs are at most twice as 
wide as those for Ti. 

Another notable feature of Figure is how large the ARE becomes as 7 —?► 0. This is a symptom of an important 
property of not just our procedure, but the frequentist problem as a whole. First, it is clear that if all the Xi = 1, our 
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Figure 2: Estimate of the asymptotic relative efficiency of T 2 to Ti. 


procedure fails, as the zf no longer provide any contrast between 9^ and cr^, and no linear combination of them will 
produce an unbiased statistic for 6*^. Intuitively, note that E [zf) oc Xip+ (1 — p), so that the problem of estimating 
p is that of estimating the slope and intercept of a regression line. But in regression, when the predictor variable 
assumes a constant value, as it would when Ai = 1, it becomes impossible to estimate the slope and intercept. To 
understand better how our procedure performs when the spread of the Xi approaches zero, consider the case when 
Ai = • • • = A „/2 = 1 + a and A„/ 2 +i = • • • = A„ = 1 — a. In this case SD(Ai) = a, and it is easy to show that 


val(7^i) 


l + a^ 
a^n 


so if ■y/nSD(Ai) —0, then a^n 0 and so val(7^i) —>■ 00. 

Returning to our original model in which X is i.i.d. iV(0,1), the A^’s will be approximately MP-distributed with 
parameter 7 = n/p. Figure shows visually how the width of the MP distribution depends on 7, and analytically, 
SD(yi),) = 1 ^ 7 . We show in the following theorem (proved in Appendix!^ that if the Ai’s are too close to 1 and 
n <C p, it is impossible for any procedure to reliably distinguish between the case of p = 0 (pure noise) and p = 1/2 
(variance equally split between signal and noise). 


Theorem 2. Let p>n> 1. Suppose that 


Z = d-DV^a + a-e 


( 2 . 12 ) 


where 0 > 0, ct > 0, and a unit vector a are all fixed but unknown, D G is a known nonnegative diagonal matrix 

with x:r=i := Er=i DUp = n,V & ^ is a random Haar-distributed orthonormal matrix, and e ^ Af(0,I„) 

independent of V. Consider the simple scenario where we are trying to distinguish between only two possibilities, 
denoted by distributions Pq and Pi: 


Pq : Z follows (2.12) with 6 = 0 and cr = 1, vs. Pi : Z follows (2.12) with 9 = a = 


V2' 


Then for any test ip : M" —>■ {0,1}, the power to correctly distinguish between these two distributions is 


bounded as 


Fz^Po mz) = 1] + Pz^p, [V-(Z) = 0] > 1 


/Er=i(A.-i)^ 



In other words, every test ip has high error, with 


/Er=i(A.-i)^ 



P(Type I error) + P(Type II error) > 1 
















X 


Figure 3: Probability density function (PDF) of Marcenko-Pastur distribution for various values of 7 . 


so that if the Xi are tightly distributed around 1 and n p, the problem of estimating p, and thus 6, is extremely 
difficult. Note that for approximately MP-distributed Xi with 7 = n/p « 0, both ~ 1)^ n/p are quite 

small, explaining the spike in ARE in Figure as 7 —>■ 0. 

Another way to evaluate how short the EigenPrism CIs are, compared to how short they could be, is to compare 


to a Bayesian procedure on a Bayesian problem. This is done in Section 3.1 


2.2.4 Computation 

As a procedure intended for use in high-dimensional settings, it is of interest to know how the EigenPrism proce¬ 
dure scales with large problem dimensions. There are essentially two parts to the procedure: the SVD, and the 
optimization (2.8 1 to choose w*. Due to the strict convexity of the optimization problem, it is extremely fast to 
solve (2.8) and in all of our simulations the runtime was dominated by the SVD computation. In Appendix [f| 


we 


include a snippet of Matlab code in the popular convex optimization language CVX (Grant and Boyd 2014 2008) 


that reformulates the optimization problem (2.8) as a second-order cone problem. Even if the optimization becomes 
extremely high-dimensional, note that the optimal weights w* are a smooth function of their associated eigenvalues 
Xi- Thus we can approximate w* extremely well by subsampling the Xi, computing a lower-resolution optimal weight 
vector, and then linearly interpolating to obtain the higher-resolution, high-dimensional w*. For the SVD, note that 
V never needs to be computed. Thus, the computation scales as n^p with a small constant of proportionality, as the 
SVD of XX^ is all that is needed. 


3 Derivative Procedures 

In this section, we go into more detail about the three related problems of performing inference on estimation error 
of a high-dimensional regression estimator, noise level in a high-dimensional linear model, and genetic signal-to-noise 
ratio, including simulation results. MATLAB code for the numerical results in this paper is available on the first 
author’s website. 


3.1 High-Dimensional Regression Error 

We have already shown in Section[l.2|that the problem of inference for high-dimensional regression error is equivalent, 


with a change of variables, to that of inference on 9. Under assumptions (1.4), our framework even allows for selection 
of a subset of 0, for instance if the doctor sees an anomaly in a region of the reconstructed image, he or she may 
only care about error in that region. In that case, for a subset of indices R (with corresponding complement 
Equation (1.3) can be rewritten as 


y = X W {(3r -0r) + (/3i^c - (3 j,.) + e = X^j^>(O r - 0r) + i, 


-( 1 ) 


( 1 )/ 
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Figure 4: 
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(a) Coverage of 95% EigenPrism and Dicker confidence intervals as a function of p tor p = 10'^ and 

14 


+ cr^ = 10^. (b) EigenPrism confidence interval, Dicker confidence interval, and Bayes credible interval widths as 
a function of n for p = 10^ and /3 sampled according to the Bayesian model in Equation (G.l). 


where s is an i.i.d. Gaussian vector independent of so that defining 9 = ||/3_r —/3 i?||2 puts this problem squarely 
into the EigenPrism framework, regardless of the fact that R may be chosen after observing (3 (recall that f3 was 
fitted on an independent subset of the data, {X. 

We note that the requirement that the columns of X be independent in order to perform inference on ||/3 — /3||| 
cannot be relaxed. However, with a known covariance S, one could instead perform inference on ||S^/^(/3 — /9)||i. 
Of course, inference for either \\^ — f3\W or ||S^/^(/3 — (3)\\2 is sufficient if the ultimate goal is to invert the GI to test 
a global null hypothesis on the coefficient vector. 

What remains to be seen then is (1) that coverage is not lost by approximating + cr^ by ||y ||2 and by assuming 
T 2 is normal, and (2) how short the resulting CIs are relative to how short they could be. To inves tigat e (1), we 
fixed p at 10^, = 10^, varied n on a log scale between 0 and p, and varied p (recall Equation 2.6) between 

0 and 1 by taking equally spaced values of log(p/(l — p)). Note that due to rotational symmetry, the direction of 
{3 is irrelevant. We ran 10"^ simulations of the EigenPrism procedure to generate 95% CIs and compared coverage 
across the settings in Figure [4(a)] We also simulated CIs using the results of Dicker ( 2014| ) by simply plugging in its 
estimators for 0^ and to its asymptotic variance formula (which depends on the exact parameters). Note that the 
EigenPrism CIs achieve at least nominal coverage in all cases, while the Dicker procedure is less reliable, especially 
for large p. One setting in which we see EigenPrism over-cover is when n ~ p and p « 1. This can be explained 
by the variance upper-bound for T 2 in Equation (2.5), which is tight when p is large and 1- 

Figure[^shows that, except when n « p, we indeed have {pJ2^=ii'^i ^ 1- 

To investigate (2), we simulated the EigenPrism procedure on a Bayesian model and compared the EigenPrism 
widths to those obtained by computing equal-tailed Bayes credible intervals (BCI) from a Gibbs-sampled posterior. 
The details of the Bayesian setup are given in Appendix]^ but the resulting Cl widths are summarized in Eigure [4(b)| 
for p = 10"*^ and a range of n. Again, we also compared to Dicker CIs. Each point on the plot represents 1000 
simulations. Although the Dicker CIs become slightly shorter than EigenPrism’s for large n, we note (as evidenced 
by Figure |4(a)| ) that this is exactly the regime in which the Dicker CIs have unreliable coverage. We will see later 
in Section [4.11 that even for small n and p, the Dicker CIs quickly lose coverage as correlations are added to the 
design matrix, while EigenPrism’s coverage is in fact quite robust. The other salient features of this plot are that 
the EigenPrism Cl widths decrease at a steady -^/n-rate, while the BCI widths start much lower and appear to 
asymptote around the EigenPrism Cl width curve. The fact that the BCI widths are much shorter for small n can be 
explained by the information contained in the priors, which is important for two reasons. In any frequentist-Bayesian 
comparison of methods, there is always the phenomenon that small n means the data contains little information, so 
the prior information given to the Bayesian method makes it heavily favored over the frequentist method. However, 
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Figure 5: Plot of the fraction ^ function of nj'p (on the log scale) for p = 10^. 


as we saw in Section 2.2.3 the frequentist problem is fundamentally limited not just by n but by SD(Ai) as well, 


and here since p is fixed, small n corresponds to small SD(A 2 ) as well, adding an extra layer of challenge for the 
EigenPrism procedure. As n increases though, the BCIs rely more heavily on the data, and come much closer in 
width to the EigenPrism CIs, with the average relative width increase bottoming-out at about 5% for n = 5000. The 


relative uptick in the EigenPrism Cl widths for n ~ p can again be explained by the upper-bonnd in Equation (2.51. 


3.2 Inference on 


We can use almost exactly the same EigenPrism procedure for cr^ as we did for 0^. Recall Equation (2.3) 

n n 

E (S'! d) ^ 9'^'^ WiXi + Wj. 


To make S unbiased for we constrained X]”=i= 0 = 1- However by switching these linear 

constraints, so that 1 make S unbiased for The variance formulae and 

upper-bounds in Equations (2.4|-(2.7) still hold, so that we can construct T 3 (and an associated Cl). Let w** be the 
solution to the following convex optimization program V 2 '- 


argmm max 

uj6R" 


E ’ E I E ~ E ~ ^ 


and denote by val( 7 ^ 2 ) the minimized objective function value. Then the EigenPrism procedure for performing 
inference on reads 


E(r3| d)=a^, 

SD(T3|d) < y/2val(lP2) — 


where again, the only approximation in the variance is the replacement of 9^ + cr^ by its estimator ||y|||/n. The 
analogue to Theorem holds and is proved in Appendix [C| 


Ts-a^ 

SB{n\d) 


7V(0,1). 
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Figure 6: Coverage of 95% EigenPrism and Dicker confidence intervals for cr^ as a function of p for p = 10^ and 
0^ + cr^ = 10^. Each point represents 10"^ simulations, and the grey line denotes nominal coverage. 
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Figure 7: (a) C overage and (b) width of scaled lasso, refitted cross validation, plug-in Cl from Dicker (20141 described 
in Section 3.1 and EigenPrism confidence intervals when n = 500, p = 1000, = 1, and non-zero entries of f3 equal 

to 1. Each point represents 10^ simulations, and the grey line denotes nominal coverage. 


Einally, as before, we construct the lower and upper endpoints to obtain an approximate (1 — q;)-CI for via 

^l-a/2 


:= max ( T 3 - • x/ 2 val(iP 2 ) ^, 0 ) , := Tg + 


Note that if the columns of X have a known covariance matrix S, the exact same machinery goes through by 
replacing X by and replacing /3 by 

Turning to simulations, we aim to show that the EigenPrism CIs for have at least nominal coverage. We take 
the same setup as in Figure 4(a) but instead construct 95% CIs for cr^. Figure]^ shows the result, and as before we 
see that EigenPrism’s coverage never dips below nominal levels in any of the settings, while for small p the Dicker 
CPs coverage can be unreliable, especially for large n. We performed a similar experiment with a Bayesian model 


12 

































to compare EigenPrism Cl widths for cr^ with those of equal-tailed BCIs, but found a less-desirable comparison than 
in the 9 case. In particular, the most favorable simulations showed the EigenPrism Cl approximately 30% wider 
than the BCI, which can likely be attributed to the more-informative prior (Inverse Gamma) on cr^ than that on 9^ 
(nearly Exponential) in the Bayesian model (G.l). Although we would have liked to try an Exponential prior for 
cr^, due to a lack of conjugacy the resulting Gibbs sampler was computationally intractable. We note that except in 
special cases, it can be very computationally challenging to construct BCIs, especially in high dimensions. 

We point out that only two other estimators in the literature provide any inference results, namely the scaled 
Lasso (Sun and Zhang 2012) and the refitted cross validation (CV) method of Fan et al. (2012). In particular, 
under some sparsity conditions on the coefficient vector, the authors find aymptotic normal approximations to their 
estimators. To compare our CIs with theirs, we compared them on the same simulations, but quickly found that 
scaled Lasso and refitted CV CIs only achieve nominal coverage in extremely sparse settings. We also compared the 
plug-in Cl for the estimator in Dicker (2014). This coverage comparison is shown in Figure [7 (a) [ The scaled Lasso 
CIs only achieve nominal coverage when 1 out of the 1000 coefficients are non-zero, and quickly drop off to less than 
half of nominal coverage by 1% sparsity. The refitted CV CIs undercover by about 10% even in the sparsest settings, 
and also fall off further in coverage as sparsity decreases. The EigenPrism and Dicker CIs achieve at least nominal 
coverage at all sparsity levels examined. Figure [7(b) | shows average Cl widths for the same simulations. The much 
smaller widths of the scaled Lasso and refitted CV CIs align with their lack of coverage, reflecting the fact that the 
bias and variance of their estimators can be poorly characterized in finite samples. The Dicker CIs are consistently 
wider than EigenPrism’s, with the inflation factor nearly 40% at the right-hand side of the plot. 


3.3 Genetic Variance Decomposition 

Consider a linear model for a centered continuous phenotype (j/i) such as height, as a function of a centered SNP 
array (xi). The variance can be decomposed as 

E{yf)=E[ixJf3f]+a^. (3.1) 


Under linkage disequilibrium, assuming column-independence is unrealistic. However, a wealth of genomic data has 


resulted in this column dependence possibly being estimable from outside data sets (e.g. 

Abecasis et al. 

(2012 


lid 

we may instead take Xi ' iV(0, S) with T known (we will discuss a relaxation of the 

normality in Section 

4.1) 


Then Equation (3.1) reduces to 


E{yf) = \\'E^^^(3\\l + a^ 
which provides a formula for the linear model’s signal-to-noise ratio. 


SNR = 


||Sl/2/3|i2+^2- 


The SNR is connected to the genetic heritability in that, for the simplified approximation to a linear model with 
additive i.i.d. noise, it quantifies what fraction of a continuous phenotype’s variance can be explained by SNP data. 
We note that there are many different definitions of heritability, and the SNR aligns most closely with the narrow- 
sense, or additive, heritability, as we do not allow for interactions or dominance effects. The extent of the connection 
between the two definitions depends on how complete the SNP array is—if every SNP is measured, they correspond 
exactly. 

Although until now we have been working with ||/3||2, while the SNR estimation problem seems to call for 
||S^/^/3||2, the above problem turns out to fit right into our framework. Explicitly, the linear model can be rewritten 
as 

y= (xS-i/2) +£, 


where now the rows of {XT: ^/^) are i.i.d. A^(0,/), and corresponds to the new quantity of interest: ||S^/^/3|j2. 
Since E (||y||^/n) = ||S% 2 / 3||2 -I- cr^ now, applying our methodology to XT gives a natural estimate for SNR, 
namely. 


SNR 


^2 

\\y\\^/n' 


Continuing, as we have done throughout this paper, to treat ||S ^/^^||2 -I- cr^ as if it is known and equal to Ijylli/’^, 
our distributional results for T 2 extend to give us an approximate confidence interval for SNR. 

We turn again to simulations to demonstrate the performance of the EigenPrism procedure described above for 
constructing SNR CIs. One major consideration is that of course, SNP data is discrete, not Gaussian. However, we 
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Figure 8: (a) Coverage, and (b) average widths, of EigenPrism SNR 95% confidence intervals. Experiments used 
n = 10^, p = 5x 10®, + = 10^, Bernoulli(O.Ol) design, and /3 with 10% non-zero, Gaussian entries. Each boxplot 

summarizes the (a) coverage and (b) width for 20 different f3’s, each of which is estimated with 500 simulations. 
The whiskers of the boxplots extend to the maximum and minimum points. The black dotted line in (a) is the 95% 
confidence lower-bound for the lowest whisker in each plot assuming all CIs achieve exact coverage, and the grey line 
shows nominal coverage. 


will show in Section |4.1| that the EigenPrism procedure works well empirically even under non-Gaussian marginal 
distributions. Here, we run experiments for n = 10®, p = 5 x 10®, 9^ + = 10^, Bernoulli(O.Ol) design with 

independent columns, (3 having 10% non-zero entries, and SNR varying from nearly 0 to nearly 1. Eigure shows 
the EigenPrism GI coverage and average widths. Note that although our CIs are conservative, we never lose coverage, 
and at worst our 95% Cl would give the SNR to within an error of ±6.8%. 


4 Robustness and 2-Step Procedure 

In this section we follow up our investigation of the EigenPrism framework by considering its robustness to model 
misspecihcation and presenting a 2-step procedure that can improve the Cl widths of the vanilla EigenPrism proce¬ 
dure. 


4.1 Robustness 


An important practical question is how robust the EigenPrism Cl is to model misspecihcation. In particular, our 
theoretical calculations made some fairly stringent assumptions, and we explore here their relative importances. 
Some standard assumptions that we rely on are that the model is indeed linear and the noise is i.i.d. Gaussian and 
independent of the design matrix. These assumptions are all present, for instance, in OLS theory, and we assume 
that problems substantially deviating from satisfying them are not appropriate for our procedure. As explained in 


Section 1.2 the random design assumption is necessitated by the high-dimensionality (p > n) of our problem, and 


within the random design paradigm, the assumption of i.i.d. rows is still broadly applicable, for instance whenever 
the rows represent samples drawn independently from a population. 

The not-so-standard assumption we make is that the columns of X are also independent, and all of JA’s entries 
are iV(0,1) (note that each column of a real design matrix can always be standardized so that at least the hrst two 
marginal moments match this assumption). These assumptions are important because they ensure that the columns 
of V are uniformly distributed on the unit sphere, so that we can characterize both the expectation and variance of 
their inner product with /3. Although we will see that the marginal distribution of the elements of X is not very 
important as long as n and p are not small, in general the independence of the columns is crucial. We note that 
there is work in random matrix theory showing that for certain random matrices which are not i.i.d. Gaussian, the 
eigenvectors are still in some sense asymptotically uniformly distributed on the unit sphere (see for example Bai 


et al. (2007)). This suggests that EigenPrism CIs, at least asymptotically, may work well in a broader context than 
shown so far. 
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Before explaining further, we feel it is important to recall that for two of the three inference problems this work 
addresses (inference for and signal-to-noise ratio), the EigenPrism procedure extends to easily account for any 
known covariance matrix among the columns of X. However in the vanilla example of simply constructing CIs for 0^, 
correlation among the columns of X can cause serious problems. To first order, we need E (|| V^/SIHI d) « O^njp, or 
else T 2 will be biased and the resulting shifted interval will have poor coverage. From a practical perspective, unless 
(3 is adversarially chosen, it may seem unlikely that (3 will be particularly aligned or misaligned (orthogonal) to the 
directions in which X varies. In particular, if we make a random effects assumption and say that the entries of f3 
are i.i.d. then the EigenPrism procedure will achieve nominal coverage. A slightly more subtle problem 

occurs if /3 is chosen not adversarially, but sparse in the basis of X’s principal components. In this case, although 
T 2 is approximately unbiased, the variance estimate could be far too small, resulting again in degraded coverage. 

To investigate how wrong the model has to be to make our CIs undercover, we construct EigenPrism CIs on 
data coming from models not satisfying our assumptions. In particular, we ran the EigenPrism procedure on design 
matrices with either i.i.d. entries with very different higher-order moments than a Gaussian, i.i.d. entries that were 
sparse, or Gaussian entries and correlated columns. Since the direction of (3 becomes relevant in all these cases, 
we performed experiments with both dense and sparse (3, and in each regime measured coverage for 20 different 
f3’s. The results of simulations with n = 10^, p = lO'^, and 9'^ + = 10'^ are plotted in Figure]^ Each boxplot 

summarizes the coverage for 20 different /3’s, each of which is estimated with 500 simulations. The whiskers of the 
boxplots extend to the maximum and minimum points, and the black dotted line is the 95% confidence lower-bound 
for the lowest whisker in each plot assuming all CIs achieve exact coverage. As can be seen from Figures 9(a)| and 


9(c) 


when (3 is dense, the marginal moments and sparsity of the entries of X do not affect coverage. Figures 


9(e) 


and 9(f)| show that even small unaccounted-for correlations among the columns of X do not greatly affect coverage, 
although larger correlations, as expected, can result in serious undercoverage for certain /3’s. As a comparison, we 
also simulated the Dicker CIs in the setting of Figures 9(e) and |9(f)[ wherein coverage never exceeded 40% for any 
/3 or correlation structure. Figures [9(b)| and |9(d)| show that when (3 is sparse, coverage is much more sensitive to 
sparsity in X, although if X is not sparse, coverage remains robust to higher-order moments of the design matrix. 
Figure [T0| demonstrates the crucial difference when X is sparse by showing a few realizations of quantile-quantile plots 
comparing the distribution of the entries of Vi to a Gaussian distribution, for Bernoulli(O.I)- and Bernoulli(O.OOI)- 
marginally-distributed X. The figure shows that the distribution for Bernoulli(O.l) is very nearly Gaussian, but 
that this is far from the case for Bernoulli(O.OOI), and thus it is the problem described at the end of the preceding 
paragraph that causes problems. 


4.2 2-Step Procedure 


Note that in the variance upper-bound of Equation (2.7), the unknown p is maximized over to remove it from the 
equation. This leads not only to conservative CIs, but suboptimal w* as well, since w* are obtained by minimizing 
this upper-bound, as opposed to the more accurate function of p. However by the end of the EigenPrism procedure, 
we have produced estimates of both 9 and -I- cr^, suggesting the possibility of a 2-step plug-in procedure to remove 


the need for the upper-bound in Equation (2.7). Explicitly, in the first step, we run the EigenPrism procedure to 
obtain an estimate p = 72/(|jy|||/n) of p. In the second step, we re-run the procedure treating p = p as known, and 


thus minimize the bound (2.5) to compute w*. Although the 2-step procedure indeed produces shorter CIs than the 


EigenPrism procedure, it does not achieve nominal coverage with the same consistency, as shown in Figure [TT| 
There are two particularly surprising aspects of this plot. The first is that the 2-step procedure produces substan¬ 


tial gains in width even for p values near 0 and 1. This is surprising because the upper-bound (2.7) that is eliminated 
by the 2-step procedure is tight when p is nearly 0 or I, however it is still not exact. The slightly loose variance 
upper bound turns out to have an optimizing w that is substantially different from the exact variance formula. The 
second surprising feature is that the width improvement is in fact smallest for p not near the endpoints 0 or 1. This 
can be explained by the clipping at 0. For p « 0, most CIs, both EigenPrism and 2-step, are cut nearly in half by 
clipping, so the fractional width improvement achieved by the 2-step procedure is fully realized. For p « I, both 
intervals are rarely clipped, and again the 2-step procedure realizes its full width improvement. However, for p not 
close to 0 or 1, many EigenPrism CIs are only slightly shrunk by clipping, so that the shorter 2-step intervals shorten 
the right side of the interval but leave the undipped left side about the same, so that much less than the full width 
improvement is realized. 

Although the 2-step procedure can provide substantial gains in width, it loses the robustness of the EigenPrism 
procedure, as shown in the slight undercoverage for n = 100 and the substantial undercoverage for large p and n = p. 
Therefore, in practice, we recommend use of the 2-step procedure instead of the EigenPrism procedure when n ^ p 
or when the statistician is confident that p is not close to 1. 
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Figure 9: The first column of plots ((a), (c), (e)) generates Pi N(0, 1) and renormalizes to control 0^, while the 
second column of plots ((b), (d), (f)) does the same but then sets 99% of the Pi to zero before renormalizing. The 
first two rows of plots ((a), (b), (c), (d)) use Xij i.i.d. from some non-Gaussian distribution renormalized to have 
mean 0 and variance 1. The third row of plots ((e), (f)) uses marginally standard Gaussian X but with correlations 
among the columns; see Appendix for detailed constructions. See text for detailed boxplot constructions and 
interpretation of the dashed line. 
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(a) (b) 

Figure 10: Quantile-quantile plots measuring the Gaussianity of 5 realizations of the entries of Vi for (a) 
Bernoulli(0.1)-distributed X and (b) Bernoulli(0.001)-distributed X. 
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Figure 11: (a) Coverage and (b) width relative to EigenPrism of 2-step confidence intervals for p = 10^ and -bcr^ = 
10'^ across a range of p and n, with each point representing 10'^ simulations. Grey lines show (a) nominal coverage 
and (b) reference ratio of 1. 


5 Variance Decomposition in the Northern Finland Birth Cohort 


We now briefly show the result of applying EigenPrism to a dataset of SNPs and continuous phenotypes to perform 
inference on the SNR = |jS^/^^||2/(||S^/^/3||2 -I- cr^). The data we use comes from the Northern Finland Birth 
Cohort 1966 (NFBC1966) ( Sabatti et al.[ [2509{ [Jarvelin et al. 2004), made available through the dbGaP database 
(accession number phs000276.v2.pl). The data consists of 5402 SNP arrays from subjects born in Northern Finland 
in 1966, as well as a number of phenotype variables measured when the subjects were 31 years old. After cleaning 
and processing the data (the details of which are provided in Appendix |l]) , 328,934 SNPs remained. The resulting 
5402 X 328,934 design matrix X contained approximately 58% O’s (homozygous wild type), 34% I’s (heterozygous), 
and 8% 2’s (homozygous minor allele). 

In order to use EigenPrism directly, we would need to know S, as simply using Ip presents two possible problems: 
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Figure 12: Distribution of the entries of some eigenvectors of the NFBC1966 design matrix. 


(1) If X is not whitened before taking the SVD, the columns of V may be far from Haar-distributed, rendering 
our bias and variance computations incorrect. 

(2) If S = Ip, then the ostensible target of our procedure is H/SHi/(ll/^lli + which may differ substantially from 
SNR = ||S1/2/3||2 /(||si/2/3||2 + 


Unfortunately, the problem of estimating the covariance matrix of a SNP array is extremely challenging (and the 
subject of much current research) due to the fact that n ^ p, even if we use outside data, so we prefer to avoid it 
here. In order to simply treat the covariance matrix as diagonal, we must consider the two problems above. There 
is a widely-held belief that the SNP locations that are important for any given trait are relatively rare (see, for 
example, Yang et al. (2010); Golan and Rosset] ( |2011 )), and thus spaced far enough apart on the genome to be 
treated as independent. This precludes problem (2) above, since with nonzero coefficients spaced far apart, we have 
« II/ 3 II 2 take the columns of X to be standardized, so the diagonal of S is all ones). For problem (1), 
we know that far apart SNPs are very nearly independent, so we may expect that the true S is roughly diagonal, and 
we already showed in Section [4.1| that the EigenPrism procedure is robust to some small unaccounted-for covariances 
when constructing CIs for \\(i\\\. To ensure that problems (1) and (2) do not cause EigenPrism to break down, we 
perform a series of diagnostics before applying it to the real data. 

Given the approximation of S as diagonal, we first performed a series of simulations to ensure EigenPrism’s accu¬ 
racy was not affected. Specifically, we ran the EigenPrism procedure (with adjustments described in the paragraph 
below) on artificially-constructed traits, but using the same standardized design matrix X from the NPBGI966 data 
set. Eor 20 different (3 vectors, we generated 500 independent Gaussian noise realizations and recorded the coverage 
of 95% EigenPrism GIs for SNR. The noise variance was I, and the /3’s were chosen to have 300 nonzero entries with 
uniformly distributed positions and all nonzero entries equal to •\/0.3/[(I — 0.3) • 300] (so that SNR = 0.3 if S = Ip). 
Table shows the coverage over the 20 (H's, and they are indeed all quite close to 95%, even though this simulation 
was conditional on X. Recomputing the target SNR using other estimates of S, such as hard-thresholding the 
empirical covariance at 0.1, changed the value of SNR very little, so that coverage was largely unaffected. 


Coverage 

90% 

91% 

92% 

93% 

94% 

95% 

96% 

97% 

98% 

99% 

100% 

Count 

1 

0 

0 

2 

1 

0 

8 

7 

1 

0 

0 


Table 2: Goverage of 20 SNR 95% GIs constructed for simulated traits using the NPBG1966 design matrix. Each 
coverage is an average over 500 simulations. 

A second diagnostic was to examine the columns of V to check for Gaussianity related to the phenomenon 
mentioned in Section |4.1| Indeed, we find that some of the columns of V are quite non-Gaussian, as shown in 
Figure [T^ However, this phenomenon is localized to only the columns of V corresponding to the very largest A^. 
Applying the unaltered EigenPrism procedure could cause two problems. Eirst, if the the first columns of V are 
not Haar-distributed, T 2 could be biased and/or higher-variance than our theory accounts for. Second, recalling 
the interpretation of EigenPrism as a weighted regression of zf on Ai, the fact that the problematic eigenvectors 
correspond to the largest eigenvalues means that they have high leverage, which exacerbates any unwanted bias 
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or variance they create. Luckily, both problems can be remedied by running the EigenPrism SNR-estimation 
procedure (with S = /p after standardizing the columns of JC) with the added constraint to the optimization 
program in Equation (2.8) that the hrst entries of w are equal to zero. Explicitly, as the non-Gaussianity of the 
columns of V appears to dissipate after around the 100*^ column, we set iCi = • • • = Wioo = 0. The choice of 100 is 
somewhat subjective, but we tried other values and obtained very similar results. Because the resulting weights still 
obey the original constraints, the estimator of ||/9||2 remains unbiased and the variance upper-bounds remain valid. 
Although motivated by the diagnostics from Section O this adjustment has the added advantage of making the 
entire EigenPrism procedure completely independent of the hrst 100 rows of U. It has been shown that the hrst rows 
of U are strongly related to the population structure of the sample (for example, the hrst two principal components 
correspond closely with subjects’ geographic origin), so constraining the hrst weights to be zero has the added effect 
of controlling for population structure (Price et al.[ 2006). As a hnal note, by subtracting off the means of each 


column of X, we reduced Ai’s rank by one, resulting in A„ = 0. As this is not actually rehective of the distribution 
of X, we also force Wn = 0 so that the last column of V and last row of U do not contribute to our estimate or 
inference. 

Encouraged by the simulation results from Table we proceeded to generate EigenPrism CIs for the SNRs of 


the 9 traits analyzed in Sabatti et al. (2009), as well as height (these 10 traits were also analyzed in Kang et al. 


(2010)). For each trait, transformation and subject exclusion was performed before computing SNR, following closely 
the procedures used in Sabatti et al. (2009); Kang et al. (2010) (see Appendix |T] for details). Lastly, all non-height 
phenotype values were adjusted for sex, pregnancy status, and oral contraceptive use, while height was only adjusted 
for sex. Tablej^gives the point estimate and 95% Cl for the SNR of each phenotype, as well as the number of subjects 


Phenotype Name 

# Samples 

SNR 95% Cl (%) 

Point Estimate (%) 

Triglycerides 

4644 

[3.1, 29.3] 

16.2 

HDL cholesterol 

4700 

[17.1, 42.9] 

30.0 

LDL cholesterol 

4682 

[27.7, 53.6] 

40.7 

C-reactive protein 

5290 

[5.6, 28.8] 

17.2 

Glucose 

4895 

[4.0, 28.9] 

16.5 

Insulin 

4867 

[0.0, 21.5] 

9.0 

BMI 

5122 

[8.9, 32.8] 

20.9 

Systolic blood pressure 

5280 

[7.8, 31.0] 

19.4 

Diastolic blood pressure 

5271 

[7.4, 30.7] 

19.0 

Height 

5306 

[46.0, 69.1] 

57.6 


Table 3: CIs for heritability estimates for each of the 10 continuous phenotypes considered, along with the number 
of samples used for each. 


used. Recall that these are CIs for the fraction of variance explained by the linear model consisting of the given array 
of SNPs. Still, these CIs generally agree quite well with heritability estimates in the literature (Kang et al. 2010|. 


For instance, (Kang et al. 2010, Supplementary Information) reports two “pseudo-heritability” estimates of 73.8% 


and 62.5% for height, and 27.9% and 24.2% for BMI, on the same data set. This is somewhat remarkable given that 
they use a completely different statistical procedure with different assumptions. In particular, while other works in 
the heritability literature tend to treat f3 as random, EigenPrism was motivated by a simple model with /3 fixed and 
the rows of X random. We find this model more realistic, as true genetic effects are not in fact random, but fixed. 
One could argue the difference is not too important as long as the genetic effects are approximately distributed as 
the random effects model chosen, but such an assumption is impossible to verify in practice, as the true effects are 
never observed. EigenPrism’s assumptions, on the other hand, are all on the design matrix, which is fully observed, 
leading to checks and diagnostics that can be performed to ensure the procedure will generate reasonable CIs. 


6 Discussion 

We have presented a framework for performing inference on the f 2 “norm of the coefficient vector in a linear regression 
model. Although the resulting confidence intervals are asymptotic, we show in extensive simulations that they 
achieve nominal coverage in finite samples, without making any assumption on the structure or sparsity of the 
coefficient vector, or requiring knowledge of a^. In simulations, we are able to relax the restrictive assumptions 
on the distribution of the design matrix and gain an understanding of when our procedure is not appropriate. 
Applying this framework to performing inference on £2 regression error, noise level, and genetic signal-to-noise ratio. 
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we develop new procedures in all three that are able to construct accurate CIs in situations not previously addressed 
in the literature. 

This work leaves open numerous avenues for further study. We briefly introduced a 2-step procedure that 
provided substantially shorter CIs than the EigenPrism procedure, but had less-consistent coverage. If we could 
better understand that procedure or come up with diagnostics for when it would undercover, we could improve on 
the EigenPrism procedure. We also explored in simulation a number of model failures that our procedure was robust 
(or not) to, but further study could provide theoretical guarantees on the coverage of the EigenPrism procedure for 
a broader class of random design models. Section [4T1 also briefly alluded to improved robustness in a random effects 
framework, which we have not explored further here. Finally, although in this work we consider a statistic that is 
linear in the , the framework and ideas of this work are not intimately tied to this restriction, and there may exist 
statistics that are nonlinear functions of the zf that give improved performance. 
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A Inference for under non-Gaussian design with known variance 

The method of Section [2.1 [also works asymptotically under more general conditions than the Gaussianity assumptions 


of (1.4). Let 2 : ~ (/i,, S) denote the statement that 2 ; has some distribution with mean pi and covariance matrix S. 
Consider again the linear model (1.11 but with relaxed assumptions, 


i.i.d. . , -p. 

X, ~ {n,Ip-np, ), 


e. (0,a^), 


again with cr^ known and X independent of e. Under this model, we get that 

2 i.i.d. ,„2 I 2 \ 

y, - {9 + a ,vi) 


and the asymptotic distribution in (2.1) in turn becomes, by the CLT, 


1 


-^WvWl - Vn{9'^ + cr^) N{0, vi), 


(A.l) 
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as n —)■ 00 , where vi does not depend on n but does depend on the unknown f3, and is given by 


vi=E{et)+4a^ 6 ^ - (3^ + 4E (ef)/x^/3 + E {xj (3)' 


In order to be less parametric, we can consider bootstrap confidence intervals based on the above calculations. 
Corresponding to (lA.ll) we can get an unbiased statistic, 


T,-.= hy\\l-^\ 

n 

E(ri) = 6|2, (A-2) 

SD(ri) = ^Jvitn, 


whose distribution we may hope to be close to Gaussian. Ti can be bootstrapped (potentially with standard bias- 
correction and acceleration) to obtain bootstrap CIs, nonparametrically dealing with the unknown variance vi. We 
ran simulations with n = 800, p = 1500, X having i.i.d. Bernoulli(0.05) entries (the columns of X were then 
standardized to have mean 0 and variance 1), 0^ = cr^ = 10, and £i i.i.d. (rescaled to have variance 10). We 
generated a single (3 uniformly on the 0-radius sphere and ran 1000 simulations (so that /3 did not change across 
simulations). Bias-corrected, accelerated 95% bootstrap CIs achieved 93.8% coverage (this is within statistical 
uncertainty of the nominal 95%, as a 95% Cl for the Cl coverage is [0.923,0.953]). 


B Calculation of variance of EigenPrism estimator 

In this section we calculate the variance of the statistic S = Er=i when conditioning on d. Here we treat w as 
fixed, but note that since we condition on d, this includes values of w that are calculated as a function of d, as in 
the EigenPrism method. 


Var(5'|d) = Var ^ 


Wizf 


n n 

i—l ^,i = i 


We now calculate each term. Recall that Xi := for i = 1,..., n. Then 

Var (zf I d) = E{zf\ d) - E{zf\ df 

= E[{d,{V,,f3)+e,f\d] - {X,e‘^ + a‘^f 

Using Ci ^ iV(0,CT^) (and the fact that e X V), 


= E[{d,{V,,l3)f\d] +6a2E[(d,(V„^))2|d] +3a4- (^02 + ^2)" 

Using (V„^)2^02.Beta(i,2^), 

= dje* - ] ^ + 6CT^Ai0^ -f 3 ct^ - (A,02 + 

P-[P + 2) 

= 2A204^^ -H 4cr2Ai02 + 2 a'^ 

* p + 2 

Also, for i ^ j, 

Cov (zf,z'^\d) = E ( zfzj I d) — E (zf | d)E (zj| d) 

= E[{d,{V^,f3) + e,fidj{Vj,/3) + ejf\d] - + a^) {Xj8^ + a^) 
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Using ei,ej N{0,a'^) (and the fact that e X V), 

= E{d^d]{V„(3)HV„(3)^\ d)+a^E{dj{V,,f3f\ d) + a^E {d^^{V„(3^] d) 

+ — {Xi 6 ^ + a^) {XjO^ + a^) 

Using {V„f3y ^ 02 . Beta (i, £^), 

= E{d^d]{V„f3f{Vj,f3f\ d) +a^X,e^ + a^X,9^ +a^ 

- (A,02 + ^2) (Aj02 + ^2) 

= d2d2E((F„/3)2(F^-,/3)2| d)-x,x,9^ 

= d^,d^^Cov{{V„(3f,{V„(3f\d) 


Using ((y„/3)2, (y„/3)2,02 - (y„/3)2 _ (y ,/3)2) ^ 02 . Dirichlet (i, i, 2^), 


-2 

p + 2 


X,Xj9*. 


Then, 


Var {S\d) = ^ wf f 2X^9 "^-—^ + Aa^Xi9^ + 2ct^ j + ^ WiWj f — ^XiXj9‘^ 


-2 


= H 40-2Ai02 + 2 ct^ j ^ ^ Wiu;^- ( ;;7-^AiAj0^ 


P + 2 


= 2a'^ ^ w1 + Aa^9‘^ ^ w^Xt + 29* 


P 


p + 2 




(Er.i»-.A.)‘ 

P + 2 


C Proof of Asymptotic Normality of T 2 and T'^ 


Proof. First consider T 2 {y,X) as a deterministic function of the random y and X. Then for any constant c, 

T2{cy,X)=c^T2{y,X). (C.l) 

Note that cy also follows a linear model, only with 9^ replaced by c^9'^ and replaced by c 2 cr 2 . Thus by taking 
c = 1/ max{0, a} we may treat 9^ and as belonging to [0,1] in order to prove asymptotic normality of T 2 {cy, X), 
which by Equation (C.l) implies asymptotic normality of T 2 {y,X). The same argument holds for T 3 , and so 


without loss of generality, in the remainder of the proof we assume 9^ and are both bounded. We have assumed 
max{ 02 ^o- 2 } > 0 , as the case 9^ = a'^ = 0 is immediately identifiable because y = 0 , and trivial. 

Recall that because V is Haar-distributed, 
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where u ~ N{0,lp). From this, we can rewrite T 2 as: 

2 


a2 , ^2^ 


n / \ i n 

T2-¥.{T2)=y^wA^/\,e |^),- +£z) -Vra.(A,0- 

^ V M\/y/p ) ^ 

^ , _ \ 2 / 2 ^ 

• (^VAl6>Ui +ei + £* (||m||/^- l)j “ (^^ + 


MV/P 


MV/p 


+ 


1 


Ip I|w|IV-P 




/ _ V. 2 

Wj (VA»6>n, + £,) -y^ Wj {Xj 


9^ + a^) 


\uP/p 
2 a^ 


n 

l\\/y/p- 1) X! Wi (^^/\6Uiei + £■ - 


z=l 


+ 


M 


M 


+ 1 - 


-j- {M\/vp-^)y2^^ 

z^l 

1 "" 

p— (||m||/v^ - if J2 W, (e? - a^) 

' 2=1 

2 ^ 

n 

yy Wi {Xi 


\uP/p 


f + a^) 


(C.2) 

(C.3) 

(C.4) 

(C.5) 

(C. 6 ) 

(C.7) 


Our goal is to show that the right-hand side of (C.2) converges to Gaussian, while ( [C3t -( |C.7[ ) each converge to zero 
in probability. In particular, using certain probabilistic properties of the Ai’s and Wi’s (which are independent of the 
other random variables), we will show convergence conditional on the Xi and Wi. We first prove the result for T 2 and 
then explain the (minor) changes needed to prove the same for T 3 (for which Equation (C.2)-(C.7| also holds). 

Before either, however, we need a few tools, including the following Lemma: 


Lemma 1. For both T 2 and T 3 , there exist constants a and b such that, 

a bXi 


Vi, n\wi\ < 


Proof. We defer the proof to the end of this section. 


min{A2,l} 


1 . 


Note that by convergence of the moments of the Xi to those of the Marcenko-Pastur (MP) distribution, Lemma[^ 
implies that 


kipAC G Op{n ^) 


for any r G K. Note also that by the Cauchy-Schwarz inequality. 




2=1 


eLia: 


1 


2(l-r) 


W" X 


2(l-r) 


G VLp{n ) 


(C. 8 ) 


(C.9) 


for any r G K. Finally, note that |jM||/^/p —^ 1. 

Starting from the bottom, (C.71 converges in probability to zero because ^1 — Er=i (Ai 6 *^ + cr^) 

0^, a constant. (C. 6 ) and (C.4) equal zero because Er=i ~ 

In (C.5), we seek to show that Yfi=i ~ converges in distribution, so that by Slutsky’s Theorem, (C.5) 

converges to zero in probability. The summands Wj {el — are independent and mean-zero with variance 2cr^w|. 
By Lyapunov’s central limit theorem ( Billingsle^ |1995] p. 362), we just need to establish the Lyapunov condition: 

t t 2 2^^^3/2 2^3/2 ^ 

(E^=l Var {w, {ef - a^))) <) 

where the G follows from and 
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In (C.3), we similarly seek to show that the sum converges in distribution, allowing us to again use Slutsky’s 
Theorem to show (C.31 converges to zero in probability. The argument is nearly the same as that for (C.5), using 
various different values of r in (C.81 and (C.9|) to establish the Lyapunov condition. 


Lastly for (Q, by Slutsky’s Theorem ([Lehman and Romano[|2005[ p. 433), it suffices to show that {^sfXiOui + Si) — 


Sr=i converges in distribution to a Gaussian random variable, which can again be established using 

Lyapunov’s central limit theorem in nearly the same way as in the argument for (C.5). Note that the resulting 

variance expression Yl'i=i = Yl'i=i 2 is not identical to the variance of S in 

(2.4), but the quotient of the two expressions converges to 1 as n,p —oo. 

Only a few changes to the above proof are needed for establishing asymptotic normality of T3. First, an analogue 
to Equation (C.9) can be shown: 


E 


> 


E n \— 2 r x- 

2 ^1=1 


3^ e Op(n 1). 


(C.IO) 


Next, in each of (C.4),(C.6), and ( C.7| ), the sum equals a constant while the coefficient in front of the sum 
converges in probability to zero. The arguments for (C.2), (C.3), and (C.5) take the same form as for T 2 except 
using (C.8) and ( jC.lO ) instead of (C.9) to establish the Lyapunov condition. □ 

Proof of Lemma [7[ We start by slightly rewriting the optimization program Vi : 


argmin t such that > <t, = 0, > WiXi = 1. 


(C.ll) 


i=l 


i=\ 


i=l 


i=l 


By the Karush-Kuhn-Tucker conditions for (C.ll), the gradient of the Lagrangian with respect to (t, Wi,..., w„) 
vanishes, i.e., 

1 - - ^2 = 0, (C.12) 

Wi (261 + 2(52A^) + Avi + K 2 Xi = 0, (C.13) 

where > 0 and (52 > 0 are the dual variables corresponding to the inequalities and ki and K 2 are the dual variables 
corresponding to the equalities. Rearranging Equation (C.13), 


-Kl - K2Xi 


2 di 


262X1 ■ 


(C.14) 


By Equation (C.12) and dual positivity constraints, we have 61,62 € [0,1]. Observe that 

+ 62X1 = min{Ai, 1 }, 


mm 

(5i G [0,1], (52=1—<5i 


establishing a lower-bound on the denominator. Now it suffices to show that |«;i|, \k 2 \ € Op(l/n). 
Multiplying Equation (C.13) hy Wi and summing over i, 


n 


n 


2^1 ^ wl + 262 ^ w^Xl + Ki E Wi + K2 ^ WiXi = 0 . 


i=l 

,,2 


i=l 
„2 \2l 


(C.15) 


By recalling that Equation (2.11) established that max{^"^^ ^ Op(l/n) and the constraints ~ 

0 and WiXi = 1, we have that |k 2| & Op{l/n). Next, by just summing Equation (|C.13 ) over i, 


26 i ^ Wi - 1 - 262 ^ WiXl + riKi + K2 ^ Ai = 0 . 


(C.16) 


i=l 




i=l 


By Cauchy-Schwarz, YJi=i WiXf < \/YlJi=i wl Z]r=i ^ Op{l), and using that |k 2| € Op{l/n) and Y1h=i Ai € Op(n), 
we find that |«:i| € Op{l/n) and the Lemma is proved for T 2 . 

To see the same result for T3, first note that rewriting V 2 analogously to (C.ll) gives the same gradient for the 
Lagrangian, so that Equations (C.12) and (C.13) still hold with the same implications for the denominator of Wi in 
Equation (C.14), so all that remains is again showing that |«:i|, |«:2| S Op{l/n). 
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We will need an analogue to Equation ( |2.11[ ) for to show that niax{^”^^ } G Op{l/n). The 


proof of Equation (2.11) can be found in Appendix [P] and follows from the construction of a simple set of weights 
Wi satisfying the constraints of Vi. By considering instead the set of weights 


v-l 


E n/2 1 

i=i ■^3 ~ ^j='‘ 


!./2+1 W 


for i < n/2, 
for i > n/2, 


satisfying the constraints of V 2 , one can follow the same steps to establish max{^.^^ '>^iyJ2i=i } G 

T 3 . Using this and the constraints of 7 ^ 2 , Equation (C.15) establishes |ki| G Op{l/n). Using this result and the same 

methods as for T 2 , Equation (C.16) establishes \k 2 \ G Op(l/n), and the Lemma is proved. □ 


D Variance upper-bound for T2 


In this section we derive the upper bound (2.11) on the variance of the statistic T 2 . For simplicity we assume that 
n is even. 


We begin by constructing a vector of weights w-. 

1 

Wi := 


j + 1 , for i < n/2, 
E;^iA, -E”=„/ 2 +iA/ 1 - 1 . for *> n/ 2 . 


Note that w satisfies the constraints of the optimization problem (2.8), and thus Var(T 2 ) is upper-bounded by 
Equation (2.7) with w plugged in. A second key observation is that we know from random matrix theory that for 
n,p —>• 00 and n/p —>• 7 G (0,1), the distribution of rescaled eigenvalues, \i, converges to the MP distribution with 
parameter 7. 

Recalling the definitions of A^, given in (2.10), this implies that 


1 ” 

^ ^ Ai • (li<„/2 — li>n/2) —t A^ 


i=l 


and 


- At ^ . 

i—1 

Together with the definition of w, these imply that as —>■ 00 with n/p —>■ 7 G ( 0 , 1 ), 

n ■ n n ■ n 1 


^ V =- 

* f [1 vn 




n 


E»’’^f = 7-7T 




2 = 1 


{«• [AELiAz-(lt<„/2-W 2 )]} Al/ 


n ■ n ■ Bry Bj 
^ A/’ 


which in turn implies 


VVar(T2) 

02 -I- (t2 


< 


\ 


o f ^ ~2 ^ -2^2^ f 1 

2 • max n 2_^ wf, n 2_^ J —^ v 2 • max —, 


2=1 2=1 




E Proof of Theoreui 


Proof. For this proof, we use Le Cam’s method (see e.g. Yu (1997 Lemma 1)), which states that 

Pz,.Po mz) = 1] + Pz..p, [^{Z) = 0] > 1 - \\Po - PiIItv , 

where IHIjv is the total variation norm: 

||Po - PiIItv = sup \Vz^Po {z & A) - {Z G A)| , 

ytCB" 


(D.l) 
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where the supremum is taken over Lebesgue-measurable sets. 
We begin by constructing a related distribution Qi: 


W = 0-DV^a-r + a-e , (E.l) 

where 6 = a = and where r ~ Xply/P is independent from V, e. We will bound 

ll-fo ~ -PiIItV — 11-^0 ~ QiIItv + ll-^l ~ QiIItv ■ 

First, we use the fact that r concentrates tightly near 1 for the following bound: 

E (|| 6 » • DV^a -r-e- DV^ a\\^) = E [E (|| 6 l • DV^ a -r-O- DV^ a\\^ \ r, F)] 

= E(||0. WTa|| 2 -|r-l|) 

< 6 » • y^E (aTFI> 2 yT[(r - 1 ) 2 ] 

= 0 • y^E {a^VDW^a) ■ ^^E (y^) - 2^^ • E {xp) + P 
Using the fact that E (y^) = p and E {xp) ^ y/P ~ 4^5 and that E ^ for each t = 1,..., n, 

E (||e. r>vT„ -- 9. dvt„||^) < 9. ^ 

_ 0 ^ 

~Wp' 

since ^ ^ ■ D\ = n. Next, for any measurable set ^ C M”, we have 

IPw^Qi {W gA)-Fz^p,(Z gA)\ 

= \E[¥ {W e A \ r,V) -V {Z e A \ r,V)]\ 

<E[\V{W GA\r,V)-V{Z GA\r, V)\] 

= E [|P {6 ■ DV^a ■r + a-eGA\r,V)-V{e- DV^ a + a ■ e G A \ r,V)\] 

< E {E [\\N {0 ■ DV^a ■ r, a^l^) - N {9 ■ DV^a, a^I„) ||^^| r, F] } 

Using the fact that ||F(/x,ct^I„) — N{fi', a^ln)\\j^ < for any fixed 


< E 


9-DV^a-r-9-DV^a ^ 

r,F^ 

rl 

V2Tra'^ 

^ _ 

1 

9y/n 


\/2T:a‘^ 




where the last step uses our calculations above. Since this is true for any ^ C M", and since 9 = a = ^ by 
assumption under the distribution Pi, we have 

Next, we bound ||Po — Qi||jv By Pinsker’s inequality, 

||Po - OiIItv < , 

where KL(-|j-) is the Kullback-Leibler divergence. Note that the distributions Pq and Qi can be reformulated as 

Po : Z, i fV(0,1) 
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and 


Q^:Z,^N[Q 


Ai + 1 


Writing po(') and qi{-) to be the densities of the distributions Pq and Qi, respectively, we have 

' qijzy 
MZ). 


KL(Qi||Po)=Ez^Qi log 


= E 


Z^Qi 


log 


/rr" _1_ 


n n 1_L 

i=l " 


= Ez^Q, \ 


log 


Ai + 1 


Ai + 1 


- 1 


Since Ez^q^ {Zf) = 


2 ^ _ A, + l 


= - E 




Using the fact that log(x) > (x — 1) — 2{x — 1)^ for all x > | 


2 ’ 




A. - 1 _ 2 f + b - 


= I i:(A - D". 


Combining everything, we have 


la - QiIItv < y^KL(Qii!Po) < . 


and so 


ia-^.iiTv<,dEo.-iP + \/$- 


□ 


F CVX code for computing the weight vector 

The following snippet of code was used with MATLAB Version 8.1 (R2013a) and CVX Version 2.1, Build 1085 on a 
64-bit Linux OS. The eigenvalues A^ are represented by the column vector lambda, t corresponds to 2val(Pi), and 
the resulting vector w corresponds to w*. 

cvx_begin 
variable t 
variable w(n) 
minimize t 
subject to 

sum(w) == 0; 
sum(w .* lambda) == 1; 
norm([w; (t/2-l)/2]) <= (t/2+l)/2; 

norm([w .* lambda; (t/2-l)/2]) <= (t/2+l)/2; 
cvx_end 
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Figure 13: Priors from model (G.ll. 


G Bayesian model 


The Bayesian model is given explicitly as follows (M, Z, and e are all independent of one another): 

M ~ Exponential(A), 

Z~iV(0,/p), 

l3=\fMZ, 

1 (G.l) 

^ ~ GammafA, B), 

£ ~iV(0,cr^/p), 

V^X(3 + e, 


where the values for the parameters used were A = (go Q'^ k, Exponential(l/2000)), A = 14, i? = 


p - --- -- - 20 , 000 ’ 

we have used the sha pe/scale parameterization of the Gamma distribution, as opposed to the shape/rate parame¬ 
terization. Figure 13 shows the resulting priors for 0^, cr^, and p = 

Note also that, although not shown, the posteriors achieved under this setup were all unimodal, so that the 
equal-tailed credible intervals were very close to the minimum-length credible intervals. We used equal-tailed credible 
intervals to give fair comparison with the EigenPrism GIs, which are also equal-tailed. The interval widths plotted 
all have nominal coverage of 80%. BCI endpoints were estimated by empirical quantiles of posterior draws from a 
Gibbs sampler, and thus we were able to much more accurately estimate the 10th and 90th percentiles than, say, the 
2.5th and 97.5th percentiles. 


H Construction of correlated-column covariance matrices 

Dense 10% Correlations used a covariance matrix with ones on the diagonal and 0.1’s as all the other entries. 
The Sparse 100 • P% Correlations used alternating P and —P as off-diagonal entries in a correlation matrix, then 
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projected that matrix into the positive semidefinite cone and reset the diagonal entries to 1. The resulting matrix has 
approximately 1/4 of its entries equal to P, 1/2 of its entries equal to 0, and 1/4 of its entries equal to —2xl0“^-(l—P). 


I Processing of NFBC1966 dataset 

Genotype features from the original data set were removed if they met any of the following conditions: 

• Not a SNP (some were, e.g., copy number variations) 

• Greater than 5% of values were missing 

• All nonmissing values belonged to the same nucleotide 

• SNP location could not be aligned to the genome 

• A test rejected Hardy-Weinberg equilibrium at the 0.01% level 

• On chromosome 23 (sex chromosome) 

The remaining missing values were assumed to take the major allele value (thus were coded as O’s in the pre-centered 
design matrix). 

For each trait, further processing was performed on the subjects. Triglycerides, BMI, insulin, and glucose were 
all log-transformed. G-reactive protein was also log-transformed after adding 0.002 mg/1 (half the detection limit) to 
0 values. Subjects were excluded from the triglycerides, HDL and LDL cholesterol, glucose, and insulin analyses if 
they were on diabetic medication or had not fasted before blood collection (or if either value was missing). Further 
subjects were excluded from the triglycerides, HDL and LDL cholesterol analyses if they were pregnant or if their 
respective phenotype measurement was more than three standard deviations from the mean, after correcting for sex, 
oral contraceptive use, and pregnancy. Subjects whose weight was not directly measured were excluded from BMI 
analysis. Of course any missing values in each phenotype were also excluded. 
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